[not for land] DDP regression tests by HDCharles · Pull Request #2613 · vllm-project/llm-compressor

HDCharles · 2026-04-13T15:26:07Z

Summary

Comprehensive test framework comparing pre-DDP (2ab02443) vs post-DDP (0bc916e5) AWQ quality across 5 models, 3 quantization formats, and 3 evaluation benchmarks
Example-style quantization scripts for each model with custom AWQ mappings where needed (Mixtral MoE w1/w2/w3, Llama-4-Scout multi-submodel scoping)
Bash harness (run_all_tests.sh) with resumable execution, progressive results table, and automatic model cleanup

Test Matrix

Models: Llama-3-8B, Qwen3-VL-8B, Llama-4-Scout-17B, Qwen2.5-32B, Mixtral-8x7B
Schemes: W4A16_ASYM, W8A8, W8A16
Benchmarks: gsm8k (5-shot), wikitext (0-shot), mmlu (5-shot)
Code states: pre-DDP, post-DDP

Usage

./run_all_tests.sh 2>&1 | tee regression_results.log
python extract_log_summary.py regression_results.log

Test plan

Smoke tested quantization at both commits with multiple schemes
Verified vLLM eval with chat template and fewshot_as_multiturn
Verified AWQ mapping patterns match correct layers for all models
Confirmed eval skip and model cleanup logic works

🤖 Generated with Claude Code

Comprehensive test framework comparing pre-DDP vs post-DDP AWQ quality across 5 models, 3 quantization formats (W4A16_ASYM, W8A8, W8A16), and 3 benchmarks (gsm8k, wikitext, mmlu) via vLLM lm_eval. Models: Llama-3-8B, Qwen3-VL-8B, Llama-4-Scout-17B, Qwen2.5-32B, Mixtral-8x7B Features: - Skips quantization/eval if results already exist (resumable) - Cleans up quantized models after eval to save disk - vLLM eval fallback chain (TP -> expert_parallel -> TP=1 -> eager -> hf) - Applies chat template and fewshot_as_multiturn for eval - Progressive results summary table with extracted metrics - Custom AWQ mappings for Mixtral (w1/w2/w3) and Llama-4-Scout (scoped to language_model) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a comprehensive regression test suite for AWQ quantization, specifically designed to compare pre-DDP and post-DDP code states. It includes individual test scripts for Llama 3, Llama 4 Scout, Mixtral, Qwen 2.5, and Qwen 3 VL, alongside a log extraction utility and an automation script. Feedback focuses on improving the portability and efficiency of the automation script, resolving a documentation contradiction regarding model cleanup, and ensuring the suite actually exercises DDP code paths by correctly configuring multi-GPU quantization.

gemini-code-assist · 2026-04-13T15:27:26Z

+# ── Helper: activate environment ─────────────────────────────────────────────
+
+activate_env() {
+    source /home/HDCharles/rhdev/bin/activate


The virtual environment activation path is hardcoded to a specific user's home directory (/home/HDCharles/...). This makes the script non-portable and will cause it to fail for any other user or in a CI environment.

gemini-code-assist · 2026-04-13T15:27:26Z

+MODEL_VLLM_ARGS=(
+    "2048,1,1"
+    "4096,1,1"
+    "4096,2,1"
+    "4096,2,1"
+    "2048,2,1"
+)


The num_gpus_quant parameter (the third value in each entry of MODEL_VLLM_ARGS) is set to 1 for all models. This means the script never executes the torchrun path (line 279) and only performs single-GPU quantization. This prevents the "DDP regression test suite" from actually exercising Distributed Data Parallel (DDP) code paths in llmcompressor, which is the primary objective of this PR.

gemini-code-assist · 2026-04-13T15:27:26Z

+#   ./run_all_tests.sh 2>&1 | tee regression_results.log
+#   python extract_log_summary.py regression_results.log
+#
+# Models are saved to disk and NOT cleaned up for follow-up evaluation.


There is a contradiction between this header comment, which states that models are "NOT cleaned up", and the implementation at line 335, which explicitly removes the quantized model directory using rm -rf "$save_dir".

gemini-code-assist · 2026-04-13T15:27:26Z

+    for scheme in "${SCHEMES[@]}"; do
+        for code_state in "${CODE_STATES[@]}"; do


The current loop structure is inefficient. It iterates through models and schemes, and for each combination, it switches the git branch and re-installs the package (pip install -e .). This results in 30 separate installations (5 models * 3 schemes * 2 states). Reordering the loops to checkout each code state once and run all applicable models/schemes for that state would significantly reduce execution time.

coderabbitai · 2026-04-13T15:30:18Z

Summary by CodeRabbit

New Features

Added AWQ quantization example scripts for Llama 3, Llama 4 Scout, Mixtral, Qwen 2.5 32B, and Qwen 3 VL models with configurable quantization schemes and calibration options.

Tests

Added regression test infrastructure including a log analyzer utility and comprehensive test runner for multi-model evaluation.

Walkthrough

Introduces a comprehensive AWQ quantization regression test suite with model-specific quantization scripts for Llama 3, Llama 4 Scout, Mixtral, Qwen 2.5-32B, and Qwen 3 Vision-Language models, along with automated testing orchestration and result summarization utilities.

Changes

Cohort / File(s)	Summary
AWQ Quantization Regression Scripts `examples/awq/regression_tests/llama3_awq.py`, `examples/awq/regression_tests/llama4_scout_awq.py`, `examples/awq/regression_tests/mixtral_awq.py`, `examples/awq/regression_tests/qwen25_32b_awq.py`, `examples/awq/regression_tests/qwen3_vl_awq.py`	Model-specific quantization test scripts implementing CLI argument parsing (`--scheme`, `--save-dir`, `--num-samples`), dataset loading/preprocessing with model-specific chat templates, AWQModifier construction with targeted ignore patterns and custom mappings, quantization execution via `oneshot()`, and model generation testing followed by compressed model saving.
Testing Automation & Orchestration `run_all_tests.sh`	Bash orchestration script performing multi-model DDP regression testing across multiple schemes and code states, managing quantization execution via git checkout, coordinating evaluation tasks with vLLM, tracking results in CSV, implementing fallback evaluation strategies, and producing summary statistics.
Log Processing Utility `extract_log_summary.py`	Python script parsing structured regression log sections to extract timing, GPU memory, and quantization metrics (GSM8K Flex/Strict, Wiki Perplexity, MMLU accuracy), generating formatted summary tables with pre/post-DDP comparisons and delta calculations.

Sequence Diagram

sequenceDiagram
    participant User
    participant Script as AWQ Script
    participant Model
    participant Dataset
    participant AWQModifier
    participant Quantizer
    participant Generator

    User->>Script: Execute with scheme & num_samples
    Script->>Model: Load pretrained model & tokenizer
    Script->>Dataset: Load calibration dataset
    Script->>Dataset: Preprocess (chat template & tokenize)
    Script->>AWQModifier: Construct recipe with scheme & ignore rules
    Script->>Quantizer: Execute oneshot(dataset, recipe, config)
    Quantizer->>Model: Apply quantization
    Quantizer-->>Script: Return quantized model
    Script->>Model: Dispatch for execution
    Script->>Generator: Generate sample text
    Generator-->>Script: Return generated tokens
    Script-->>User: Save quantized model & report metrics

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

awq, enhancement, qwen, llama

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 23.81% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[not for land] DDP regression tests' is directly related to the changeset, which adds a comprehensive DDP regression test suite with quantization scripts and a bash harness.
Description check	✅ Passed	The description clearly explains the purpose of the PR: a regression test framework comparing pre-DDP and post-DDP AWQ quality, including test matrix, models, schemes, benchmarks, usage instructions, and test plan verification.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch awq-ddp-regression-tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-13T15:35:43Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

coderabbitai

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/awq/regression_tests/llama3_awq.py`:
- Around line 29-30: The model-loading call using
AutoModelForCausalLM.from_pretrained currently passes dtype="auto" which is
ignored; change the parameter name to torch_dtype="auto" so the model is loaded
with the correct precision (e.g., update the call in llama3_awq.py where
AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto") is used and make
the identical change in qwen25_32b_awq.py for its
AutoModelForCausalLM.from_pretrained call; keep the
AutoTokenizer.from_pretrained(..., trust_remote_code=True) calls unchanged.

In `@examples/awq/regression_tests/llama4_scout_awq.py`:
- Around line 19-128: Run ruff format on the file to fix the style errors
reported by CI: run `ruff format` (or your editor's ruff formatting) and stage
the resulting changes; focus on formatting inside the main() block and helper
functions (main, preprocess_function, data_collator, and the
AWQModifier/AWQMapping recipe) so the file passes `ruff format --check` and the
CI quality job turns green. Ensure you do not alter logic—only whitespace, line
breaks, and import/style formatting—then commit the formatted file.

In `@examples/awq/regression_tests/qwen25_32b_awq.py`:
- Around line 29-30: The model is being loaded with the wrong keyword argument:
replace the incorrect dtype="auto" used in AutoModelForCausalLM.from_pretrained
with the correct torch_dtype="auto" so the 32B model loads in the intended
precision; update the call to AutoModelForCausalLM.from_pretrained(MODEL_ID,
torch_dtype="auto") (keeping AutoTokenizer.from_pretrained(MODEL_ID,
trust_remote_code=True) unchanged).

In `@examples/awq/regression_tests/qwen3_vl_awq.py`:
- Around line 103-127: The smoke test currently pulls a remote image URL inside
the messages list (the "image": "http://images.cocodataset.org/...") which
introduces a network dependency that can fail before quantization is saved;
update the messages construction so it uses a local/test image already present
in the harness (e.g., reuse an existing dataset image buffer or path used
earlier) or make the generation step best-effort by wrapping the
vision-processing/generation sequence (process_vision_info, processor(...),
model.generate) in a try/except that logs the failure and continues without
marking the quantization as failed; target the messages variable,
process_vision_info invocation, the processor(...) call and the
model.generate(...) call when applying the change.

In `@extract_log_summary.py`:
- Around line 41-50: The current code searches the whole content for the
MODEL/CODE STATE pattern so repeats reuse the first global match; instead
restrict the search to the current scheme's block: first locate the scheme
header for this iteration (e.g., find the scheme-start boundary in content),
then run re.search(pattern, scheme_block, re.DOTALL) so match, start,
next_section and body are computed relative to that scheme_block (not the full
content); update the logic around pattern, match, start, next_section and body
to slice content to the scheme range before extracting the MODEL and CODE STATE
section.

In `@run_all_tests.sh`:
- Around line 267-269: The cleanup currently removes $save_dir unconditionally
which defeats the earlier quantization skip; update the flow in run_all_tests.sh
so that before doing any quantization/eval work you check for the presence of
the final evaluation outputs (results_*.json) and, if they already exist, skip
the entire quantize+eval block (the code guarded by the $save_dir existence
check and the later removal code around lines referencing results_*.json and the
rm -rf $save_dir cleanup); alternatively, defer deleting $save_dir until after
you confirm all eval outputs have been produced—ensure this change touches the
quantize check that uses $save_dir and the later cleanup block that deletes
$save_dir so resumable runs do not force re-quantization.
- Around line 58-60: The activate_env() function currently hardcodes a personal
path (/home/HDCharles/rhdev/bin/activate); change it to not assume that path by
either: 1) removing the function and documenting that the caller must activate
the virtualenv, or 2) making activate_env() read a venv path from an environment
variable (e.g. VENV_ACTIVATE or VENV_PATH) or a CLI argument, validate that the
file exists before sourcing, and emit a clear error and exit if not set/invalid;
update any callers to pass the variable or rely on the caller-activated
environment.
- Around line 204-215: The current switch_code_state() function checks out
historical commits in-place which removes the test harness and newly added
regression scripts (examples/awq/regression_tests/*); change the approach to
avoid in-place checkouts by either (A) using git worktree or a fresh clone:
create a separate worktree/clone for PRE_DDP_COMMIT and POST_DDP_COMMIT, run pip
install -e . in those isolated directories and invoke the regression scripts
from those worktrees, or (B) before any checkout, copy the harness and
examples/awq/regression_tests/* to a temporary directory and run the tests from
that safe copy; update switch_code_state() and the test invocation logic to
reference the worktree/temporary copy instead of the current working tree so the
scripts introduced by the PR are never removed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 85ec6662-14e8-47f9-9c74-4a521f89aaa5

📥 Commits

Reviewing files that changed from the base of the PR and between 9d328ed and c13fce8.

📒 Files selected for processing (7)

examples/awq/regression_tests/llama3_awq.py
examples/awq/regression_tests/llama4_scout_awq.py
examples/awq/regression_tests/mixtral_awq.py
examples/awq/regression_tests/qwen25_32b_awq.py
examples/awq/regression_tests/qwen3_vl_awq.py
extract_log_summary.py
run_all_tests.sh

coderabbitai · 2026-04-13T15:39:16Z

+    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check the actual code at the specified lines echo "=== Checking llama3_awq.py lines 25-35 ===" cat -n examples/awq/regression_tests/llama3_awq.py | sed -n '25,35p' echo "" echo "=== Checking qwen25_32b_awq.py ===" if [ -f "examples/awq/regression_tests/qwen25_32b_awq.py" ]; then grep -n "AutoModelForCausalLM.from_pretrained" examples/awq/regression_tests/qwen25_32b_awq.py else echo "File not found" fi echo "" echo "=== Checking llama4_scout_awq.py ===" if [ -f "examples/awq/regression_tests/llama4_scout_awq.py" ]; then grep -n "AutoModelForCausalLM.from_pretrained" examples/awq/regression_tests/llama4_scout_awq.py else echo "File not found" fi

Repository: vllm-project/llm-compressor

Length of output: 697

🌐 Web query:

AutoModelForCausalLM.from_pretrained dtype parameter torch_dtype hugging face transformers

💡 Result:

The torch_dtype parameter in AutoModelForCausalLM.from_pretrained specifies the PyTorch data type (dtype) for loading the model weights, overriding the default torch.float32. It accepts a torch.dtype (e.g., torch.float16, torch.bfloat16) or the string "auto". When "auto" is used: - Checks config.json for a torch_dtype or dtype entry. - If absent, infers from the dtype of the first floating-point weight in the checkpoint. Examples: import torch from transformers import AutoModelForCausalLM # Load with specific dtype model = AutoModelForCausalLM.from_pretrained("gpt2", torch_dtype=torch.bfloat16) # Auto-detect dtype model = AutoModelForCausalLM.from_pretrained("gpt2", torch_dtype="auto") Note: torch_dtype primarily affects parameters; some buffers with explicit dtypes (e.g., torch.float32) remain unchanged. In recent versions (post-v4.XX), it is also referred to as dtype in examples, but torch_dtype remains valid. This parameter helps reduce memory usage (e.g., FP16/BF16) and is passed via kwargs to the underlying PreTrainedModel.from_pretrained method.

Citations:

1: https://hf.co/docs/transformers/models

2: https://huggingface.co/docs/transformers/v4.53.2/models

3: https://huggingface.co/transformers/v4.10.1/main_classes/model.html

4: https://huggingface.co/docs/transformers/v4.36.1/en/main_classes/model

5: https://huggingface.co/transformers/v4.11.3/main_classes/model.html

6: from_pretrained torch_dtype DO NOT affect model buffers huggingface/transformers#30709

7: torch_dtype='auto' is not working when using AutoModel.from_pretrained(...) huggingface/transformers#24774

Change dtype parameter to torch_dtype in model-loading calls.

Line 29 uses dtype="auto" but the Transformers API requires torch_dtype="auto". Using the incorrect parameter name causes it to be ignored, loading the model at default precision (float32) instead of auto-detecting the optimal dtype. This undermines the regression test by producing invalid memory and timing measurements.

The same issue exists in examples/awq/regression_tests/qwen25_32b_awq.py on line 29.

Suggested changes

- model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto") + model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")

Apply the same fix to qwen25_32b_awq.py.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/awq/regression_tests/llama3_awq.py` around lines 29 - 30, The model-loading call using AutoModelForCausalLM.from_pretrained currently passes dtype="auto" which is ignored; change the parameter name to torch_dtype="auto" so the model is loaded with the correct precision (e.g., update the call in llama3_awq.py where AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto") is used and make the identical change in qwen25_32b_awq.py for its AutoModelForCausalLM.from_pretrained call; keep the AutoTokenizer.from_pretrained(..., trust_remote_code=True) calls unchanged.

coderabbitai · 2026-04-13T15:39:17Z

+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--scheme", default="W4A16_ASYM")
+    parser.add_argument("--save-dir", default=None)
+    parser.add_argument("--num-samples", type=int, default=NUM_CALIBRATION_SAMPLES)
+    args = parser.parse_args()
+
+    num_samples = args.num_samples
+
+    model = Llama4ForConditionalGeneration.from_pretrained(MODEL_ID, dtype="auto")
+    processor = Llama4Processor.from_pretrained(MODEL_ID)
+
+    ds = load_dataset(
+        DATASET_ID, name="LLM", split=f"train[:{num_samples}]"
+    )
+
+    def preprocess_function(example):
+        messages = []
+        for message in example["messages"]:
+            messages.append(
+                {
+                    "role": message["role"],
+                    "content": [{"type": "text", "text": message["content"]}],
+                }
+            )
+
+        return processor.apply_chat_template(
+            messages,
+            return_tensors="pt",
+            padding=False,
+            truncation=True,
+            max_length=MAX_SEQUENCE_LENGTH,
+            tokenize=True,
+            add_special_tokens=False,
+            return_dict=True,
+            add_generation_prompt=False,
+        )
+
+    ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)
+
+    def data_collator(batch):
+        assert len(batch) == 1
+        return {
+            key: (
+                torch.tensor(value)
+                if key != "pixel_values"
+                else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
+            )
+            for key, value in batch[0].items()
+        }
+
+    # Llama-4-Scout has both vision_model and language_model sub-models,
+    # so mappings must be scoped to language_model to avoid dual matches.
+    # The main experts use a fused gate_up_proj (not Linear), so only
+    # shared_expert Linear layers are AWQ targets.
+    recipe = AWQModifier(
+        targets="Linear",
+        scheme=args.scheme,
+        ignore=[
+            "re:.*lm_head",
+            "re:.*self_attn",
+            "re:.*router",
+            "re:.*vision_model.*",
+            "re:.*multi_modal_projector.*",
+            "Llama4TextAttention",
+        ],
+        mappings=[
+            AWQMapping(
+                "re:.*language_model.*post_attention_layernorm$",
+                [
+                    "re:.*shared_expert.gate_proj$",
+                    "re:.*shared_expert.up_proj$",
+                ],
+            ),
+            AWQMapping(
+                "re:.*shared_expert.up_proj$",
+                ["re:.*shared_expert.down_proj$"],
+            ),
+        ],
+    )
+
+    torch.cuda.reset_peak_memory_stats()
+    start_time = time.time()
+
+    oneshot(
+        model=model,
+        dataset=ds,
+        recipe=recipe,
+        max_seq_length=MAX_SEQUENCE_LENGTH,
+        num_calibration_samples=num_samples,
+        data_collator=data_collator,
+        sequential_targets=["Llama4TextMLP"],
+    )
+
+    elapsed_time = time.time() - start_time
+    peak_memory_gb = torch.cuda.max_memory_allocated() / (1024**3)
+    print("Quantization Complete")
+    print(f"Time: {elapsed_time / 60:.2f} minutes ({elapsed_time:.2f} seconds)")
+    print(f"Peak GPU Memory: {peak_memory_gb:.2f} GB")
+
+    save_dir = args.save_dir or (
+        MODEL_ID.rstrip("/").split("/")[-1] + f"-{args.scheme}"
+    )
+    model.save_pretrained(save_dir, save_compressed=True)
+    processor.save_pretrained(save_dir)
+    print(f"Model saved to {save_dir}")
+
+
+if __name__ == "__main__":
+    main()


⚠️ Potential issue | 🟡 Minor

Run ruff format on this file before merge.

CI is already red on examples/awq/regression_tests/llama4_scout_awq.py due to ruff format --check, so this file needs a formatting pass to get the quality job green.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/awq/regression_tests/llama4_scout_awq.py` around lines 19 - 128, Run ruff format on the file to fix the style errors reported by CI: run `ruff format` (or your editor's ruff formatting) and stage the resulting changes; focus on formatting inside the main() block and helper functions (main, preprocess_function, data_collator, and the AWQModifier/AWQMapping recipe) so the file passes `ruff format --check` and the CI quality job turns green. Ensure you do not alter logic—only whitespace, line breaks, and import/style formatting—then commit the formatted file.

coderabbitai · 2026-04-13T15:39:17Z

+    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
+    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check the actual content at lines 29-30 in the file cat -n examples/awq/regression_tests/qwen25_32b_awq.py | sed -n '25,35p'

Repository: vllm-project/llm-compressor

Length of output: 504

🏁 Script executed (no clone):

#!/bin/bash # Search for Hugging Face Transformers documentation on from_pretrained parameter names # Check what the correct parameter should be curl -s "https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/model#transformers.PreTrainedModel.from_pretrained" 2>/dev/null | grep -i "torch_dtype\|dtype" | head -20 || echo "Could not fetch docs directly"

Length of output: 23243

🏁 Script executed:

#!/bin/bash # Check for other usages of from_pretrained in the codebase to see patterns rg "from_pretrained.*\(dtype|torch_dtype)" examples/ --type py -A 2 -B 2

Repository: vllm-project/llm-compressor

Length of output: 195

🏁 Script executed:

#!/bin/bash # Also check the llama3 file mentioned in the scratchpad to see if it has the same issue cat -n examples/awq/regression_tests/llama3_awq.py | sed -n '25,35p'

Repository: vllm-project/llm-compressor

Length of output: 504

🏁 Script executed:

#!/bin/bash # Search for other from_pretrained usages to see what pattern is typically used rg "from_pretrained\(" examples/ --type py -A 1 | head -50

Repository: vllm-project/llm-compressor

Length of output: 3643

🏁 Script executed:

#!/bin/bash # Check if there are any successful uses of torch_dtype in the repo rg "torch_dtype" --type py

Repository: vllm-project/llm-compressor

Length of output: 2328

Use the correct torch_dtype parameter for model loading.

Line 29 uses dtype="auto", but Transformers' from_pretrained() API expects torch_dtype="auto". The incorrect parameter name is silently ignored, causing the 32B model to load in the wrong precision and invalidating regression test results.

Suggested change

- model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto") + model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/awq/regression_tests/qwen25_32b_awq.py` around lines 29 - 30, The model is being loaded with the wrong keyword argument: replace the incorrect dtype="auto" used in AutoModelForCausalLM.from_pretrained with the correct torch_dtype="auto" so the 32B model loads in the intended precision; update the call to AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto") (keeping AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True) unchanged).

coderabbitai · 2026-04-13T15:39:17Z

+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image",
+                    "image": "http://images.cocodataset.org/train2017/000000231895.jpg",
+                },
+                {"type": "text", "text": "Please describe the animal in this image\n"},
+            ],
+        }
+    ]
+    prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
+    image_inputs, video_inputs = process_vision_info(messages)
+    inputs = processor(
+        text=[prompt],
+        images=image_inputs,
+        videos=video_inputs,
+        padding=False,
+        max_length=MAX_SEQUENCE_LENGTH,
+        truncation=True,
+        return_tensors="pt",
+    ).to(model.device)
+    output = model.generate(**inputs, max_new_tokens=100)
+    print(processor.decode(output[0], skip_special_tokens=True))


⚠️ Potential issue | 🟠 Major

Remove the network dependency from the smoke test.

Lines 109-124 pull a remote image for the post-quantization sanity check, and this runs before Lines 133-135 save the quantized artifact. In the harness, a failure here marks the whole block as QUANT_FAILED, so a flaky network can discard a successful quantization for reasons unrelated to AWQ. Reuse one of the already-loaded dataset images or make this generation step best-effort instead.
As per coding guidelines, examples/**/*.py: "Ensure examples follow best practices and are runnable."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/awq/regression_tests/qwen3_vl_awq.py` around lines 103 - 127, The smoke test currently pulls a remote image URL inside the messages list (the "image": "http://images.cocodataset.org/...") which introduces a network dependency that can fail before quantization is saved; update the messages construction so it uses a local/test image already present in the harness (e.g., reuse an existing dataset image buffer or path used earlier) or make the generation step best-effort by wrapping the vision-processing/generation sequence (process_vision_info, processor(...), model.generate) in a try/except that logs the failure and continues without marking the quantization as failed; target the messages variable, process_vision_info invocation, the processor(...) call and the model.generate(...) call when applying the change.

coderabbitai · 2026-04-13T15:39:17Z

+        # Find the body for this section
+        pattern = re.escape(f"MODEL: {model}") + r".*?" + re.escape(f"CODE STATE: {code_state}")
+        match = re.search(pattern, content, re.DOTALL)
+        if not match:
+            continue
+        # Get everything after the match until the next section
+        start = match.end()
+        next_section = re.search(r"║\s+MODEL:", content[start:])
+        end = start + next_section.start() if next_section else len(content)
+        body = content[start:end]


⚠️ Potential issue | 🔴 Critical

This re-search picks the wrong section body for repeated schemes.

Lines 42-49 search the full log again using only MODEL and CODE STATE. Each model/code-state pair appears once per scheme, so every later scheme will reuse the first matching block and get parsed with the wrong metrics. That makes the summary table incorrect for repeated model/scheme combinations.

Suggested fix

- sections = section_pattern.findall(content) + sections = list(section_pattern.finditer(content)) @@ - for model, scheme, code_state in sections: - model = model.strip() - scheme = scheme.strip() - code_state = code_state.strip() - - # Find the body for this section - pattern = re.escape(f"MODEL: {model}") + r".*?" + re.escape(f"CODE STATE: {code_state}") - match = re.search(pattern, content, re.DOTALL) - if not match: - continue - # Get everything after the match until the next section - start = match.end() - next_section = re.search(r"║\s+MODEL:", content[start:]) - end = start + next_section.start() if next_section else len(content) - body = content[start:end] + for idx, match in enumerate(sections): + model, scheme, code_state = (part.strip() for part in match.groups()) + start = match.end() + end = sections[idx + 1].start() if idx + 1 < len(sections) else len(content) + body = content[start:end]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@extract_log_summary.py` around lines 41 - 50, The current code searches the whole content for the MODEL/CODE STATE pattern so repeats reuse the first global match; instead restrict the search to the current scheme's block: first locate the scheme header for this iteration (e.g., find the scheme-start boundary in content), then run re.search(pattern, scheme_block, re.DOTALL) so match, start, next_section and body are computed relative to that scheme_block (not the full content); update the logic around pattern, match, start, next_section and body to slice content to the scheme range before extracting the MODEL and CODE STATE section.

coderabbitai · 2026-04-13T15:39:17Z

+activate_env() {
+    source /home/HDCharles/rhdev/bin/activate
+}


⚠️ Potential issue | 🟠 Major

Don't hardcode a personal virtualenv path.

Line 59 assumes /home/HDCharles/rhdev/bin/activate exists. Everywhere except that workstation, source fails before any regression runs start. Let the caller activate the environment first, or take the venv path from an env var / CLI argument.

🧰 Tools

🪛 Shellcheck (0.11.0)

[info] 59-59: Not following: /home/HDCharles/rhdev/bin/activate was not specified as input (see shellcheck -x).

(SC1091)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@run_all_tests.sh` around lines 58 - 60, The activate_env() function currently hardcodes a personal path (/home/HDCharles/rhdev/bin/activate); change it to not assume that path by either: 1) removing the function and documenting that the caller must activate the virtualenv, or 2) making activate_env() read a venv path from an environment variable (e.g. VENV_ACTIVATE or VENV_PATH) or a CLI argument, validate that the file exists before sourcing, and emit a clear error and exit if not set/invalid; update any callers to pass the variable or rely on the caller-activated environment.

coderabbitai · 2026-04-13T15:39:17Z

+switch_code_state() {
+    local state=$1
+
+    if [ "$state" == "pre-ddp" ]; then
+        echo "Switching to pre-DDP code ($PRE_DDP_COMMIT)..."
+        git checkout "$PRE_DDP_COMMIT" 2>&1
+        pip install -e . --quiet 2>&1
+    elif [ "$state" == "post-ddp" ]; then
+        echo "Switching to post-DDP code ($POST_DDP_COMMIT)..."
+        git checkout "$POST_DDP_COMMIT" 2>&1
+        pip install -e . --quiet 2>&1
+    fi


⚠️ Potential issue | 🔴 Critical

Checking out the target commits here removes the very scripts you're about to run.

This harness runs from the current PR branch, but Lines 207-214 switch the repo to historical commits before Lines 278-283 invoke examples/awq/regression_tests/*.py. Because those regression scripts are introduced by this PR, the checkout will remove or replace them from the working tree, so the subsequent quantization command can fail with “file not found” or run the wrong script contents. Use separate worktrees/clones for the pre/post code states, or copy the harness scripts outside the repo before switching revisions.

Also applies to: 278-283

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@run_all_tests.sh` around lines 204 - 215, The current switch_code_state() function checks out historical commits in-place which removes the test harness and newly added regression scripts (examples/awq/regression_tests/*); change the approach to avoid in-place checkouts by either (A) using git worktree or a fresh clone: create a separate worktree/clone for PRE_DDP_COMMIT and POST_DDP_COMMIT, run pip install -e . in those isolated directories and invoke the regression scripts from those worktrees, or (B) before any checkout, copy the harness and examples/awq/regression_tests/* to a temporary directory and run the tests from that safe copy; update switch_code_state() and the test invocation logic to reference the worktree/temporary copy instead of the current working tree so the scripts introduced by the PR are never removed.

coderabbitai · 2026-04-13T15:39:17Z

+            # ── Quantize (skip if model already exists) ────────────
+            if [ -d "$save_dir" ] && [ -f "$save_dir/config.json" ]; then
+                echo "Quantized model already exists at $save_dir, skipping quantization."


⚠️ Potential issue | 🟠 Major

The cleanup step defeats the resumable quantization skip.

Lines 268-269 only skip quantization while $save_dir still exists, but Lines 333-336 delete it unconditionally after every block. On a resumed run, the harness will therefore re-quantize every model/scheme/state even when Lines 309-317 are about to skip all evaluations because results_*.json already exist. If resumability is a goal, skip the whole block up front when all eval outputs are present, or keep the quantized model until that block is fully reusable.

Also applies to: 309-317, 332-336

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@run_all_tests.sh` around lines 267 - 269, The cleanup currently removes $save_dir unconditionally which defeats the earlier quantization skip; update the flow in run_all_tests.sh so that before doing any quantization/eval work you check for the presence of the final evaluation outputs (results_*.json) and, if they already exist, skip the entire quantize+eval block (the code guarded by the $save_dir existence check and the later removal code around lines referencing results_*.json and the rm -rf $save_dir cleanup); alternatively, defer deleting $save_dir until after you confirm all eval outputs have been produced—ensure this change touches the quantize check that uses $save_dir and the later cleanup block that deletes $save_dir so resumable runs do not force re-quantization.

mergify · 2026-04-13T15:50:00Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

HDCharles changed the title ~~[AWQ] Add DDP regression test suite~~ [not for land] DDP regression tests Apr 13, 2026

gemini-code-assist Bot reviewed Apr 13, 2026

View reviewed changes

coderabbitai Bot added enhancement New feature or request qwen For any PR / issue related to Qwen support llama For any PR / issue related to Llama herd support awq For any issue / PR related to AWQ support labels Apr 13, 2026

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

mergify Bot added the documentation Improvements or additions to documentation label Apr 13, 2026

mergify Bot added the quality-failed label Apr 13, 2026

		for scheme in "${SCHEMES[@]}"; do
		for code_state in "${CODE_STATES[@]}"; do

		model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype="auto")
		tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

Conversation

HDCharles commented Apr 13, 2026

Summary

Test Matrix

Usage

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

New Features

Tests

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested labels

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading